This R Markdown document includes contributions by Professor Jessi Kehe.
COURSE/scripts/viridis.RCOURSE/data/lions.csvCOURSE/data/height.txtCOURSE/data/exoplanets-clean-through-2022.csvImage of Lion Noses from https://www.panthera.org/blog/2016/08/17/how-age-lion
lions = read_csv("../../data/lions.csv") %>%
rename(black = proportion.black)
## first few rows
head(lions)
## # A tibble: 6 × 2
## age black
## <dbl> <dbl>
## 1 1.1 0.21
## 2 1.5 0.14
## 3 1.9 0.11
## 4 2.2 0.13
## 5 2.6 0.12
## 6 3.2 0.13
lions_sum = lions %>%
summarize(across(everything(), list(mean = mean, sd = sd)),
n = n(),
r = cor(age, black)) %>%
relocate(n)
lions_sum
## # A tibble: 1 × 6
## n age_mean age_sd black_mean black_sd r
## <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 32 4.31 2.68 0.322 0.199 0.790
geom_smooth().ggplot(lions, aes(x = age, y = black)) +
geom_point() +
xlab("Age (years)") +
ylab("Percentage Black") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Ages and Lion Nose Color") +
geom_smooth(se = FALSE, method = "lm") +
theme_bw() +
theme(text = element_text(size = 20))
age and the
proportion of the nose which is black.
r is a measure of
the strength of a linear relationship between two quantitative
variables.\[ r = \mathsf{Corr}(x,y) = \frac{1}{n-1}\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{s_x} \right) \left(\frac{y_i - \bar{y}}{s_y} \right) \]
\[ r = \mathsf{Corr}(x,y) = \frac{1}{n}\sum_{i=1}^n \left(\frac{x_i - \bar{x}}{\hat{\sigma}_x} \right) \left(\frac{y_i - \bar{y}}{\hat{\sigma}_y} \right) \]
ggplot(lions, aes(x = age, y = black)) +
geom_point() +
xlab("Age (years)") +
ylab("Percentage Black") +
scale_y_continuous(labels = scales::percent) +
ggtitle("Ages and Lion Nose Color",
subtitle = "Red dashed lines at variable means") +
geom_vline(xintercept = lions_sum$age_mean, color = "red", linetype = "dashed") +
geom_hline(yintercept = lions_sum$black_mean, color = "red", linetype = "dashed") +
theme_bw() +
theme(text = element_text(size = 20))
cor() calculate the correlation
coefficient.x = lions %>% pull(age)
y = lions %>% pull(black)
cor(x,y)
## [1] 0.7898272
lions_calc = lions %>%
mutate(z_age = (age - mean(age))/sd(age),
z_black = (black - mean(black))/sd(black),
prod = z_age * z_black)
lions_calc %>%
print(n = Inf)
## # A tibble: 32 × 5
## age black z_age z_black prod
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.1 0.21 -1.20 -0.565 0.677
## 2 1.5 0.14 -1.05 -0.918 0.963
## 3 1.9 0.11 -0.900 -1.07 0.962
## 4 2.2 0.13 -0.788 -0.968 0.763
## 5 2.6 0.12 -0.639 -1.02 0.650
## 6 3.2 0.13 -0.414 -0.968 0.401
## 7 3.2 0.12 -0.414 -1.02 0.422
## 8 2.9 0.18 -0.527 -0.716 0.377
## 9 2.4 0.23 -0.713 -0.464 0.331
## 10 2.1 0.22 -0.825 -0.515 0.425
## 11 1.9 0.2 -0.900 -0.615 0.554
## 12 1.9 0.17 -0.900 -0.766 0.690
## 13 1.9 0.15 -0.900 -0.867 0.781
## 14 1.9 0.27 -0.900 -0.263 0.237
## 15 2.8 0.26 -0.564 -0.313 0.177
## 16 3.6 0.21 -0.265 -0.565 0.150
## 17 4.3 0.3 -0.00350 -0.112 0.000391
## 18 3.8 0.42 -0.190 0.493 -0.0937
## 19 4.2 0.43 -0.0409 0.543 -0.0222
## 20 5.4 0.59 0.407 1.35 0.550
## 21 5.8 0.6 0.557 1.40 0.779
## 22 6 0.72 0.632 2.00 1.27
## 23 3.4 0.29 -0.340 -0.162 0.0551
## 24 4 0.1 -0.116 -1.12 0.129
## 25 7.3 0.48 1.12 0.795 0.888
## 26 7.3 0.44 1.12 0.593 0.663
## 27 7.8 0.34 1.30 0.0897 0.117
## 28 7.1 0.37 1.04 0.241 0.251
## 29 7.1 0.34 1.04 0.0897 0.0935
## 30 13.1 0.74 3.28 2.10 6.91
## 31 8.8 0.79 1.68 2.36 3.95
## 32 5.4 0.51 0.407 0.946 0.385
lions_calc %>%
summarize(r = sum(prod) / (n() - 1)) %>%
pull(r)
## [1] 0.7898272
When \(r=0\), there is a weak linear relationship between \(x\) and \(y\). There may or may not be a strong non-linear relationship between \(x\) and \(y\).
When \(r = 0.97\) there is a strong positive linear relationship between \(x\) and \(y\). There may or may not be an even stronger non-linear relationship between \(x\) and \(y\).
When \(r = -0.97\), there is a strong negative linear relationship between \(x\) and \(y\) There may or may not be an even stronger non-linear relationship between \(x\) and \(y\).
When \(r\) is near 0.7, there is a moderate positive linear relationship between \(x\) and \(y\). There may or may not be a nonlinear relationship which fits the data substantially better
Correlation tells us about the strength and direction of a line.
The regression line…
height = read_table("../../data/height.txt")
df = height %>%
filter(age >=2*12 & age <=8*12)
ggplot(df, aes(x = age, y = height)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
ylab("Height (inches)") +
xlab("Age (months)")
\[ Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \]
\[ \sum_{i=1}^n \big(y_i - (\beta_0 + \beta_1 x_i)\big)^2 \]
lm() to fit the linear regression modeldf_lm = lm(height ~ age, data = df)
cf = coef(df_lm)
cf
## (Intercept) age
## 30.2493290 0.2505185
summary(df_lm)
##
## Call:
## lm(formula = height ~ age, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29911 -0.19291 -0.03355 0.21982 0.46334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.249329 0.186745 161.98 <2e-16 ***
## age 0.250519 0.002869 87.33 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2429 on 14 degrees of freedom
## Multiple R-squared: 0.9982, Adjusted R-squared: 0.998
## F-statistic: 7627 on 1 and 14 DF, p-value: < 2.2e-16
\[ \hat{\beta}_1 = r \times \frac{s_y}{s_x} \]
\[ \hat{\beta}_0 = \bar{y} - b_1 \bar{x} \]
x = df$age
y = df$height
xbar = mean(x)
ybar = mean(y)
sx = sd(x)
sy = sd(y)
r = cor(x,y)
c(xbar, ybar, sx, sy, r)
## [1] 61.5625000 45.6718750 21.8661649 5.4829043 0.9990835
b1 = r *sy/sx
b0 = ybar - b1*xbar
c(b0, b1)
## [1] 30.2493290 0.2505185
cf
## (Intercept) age
## 30.2493290 0.2505185
Using observations \((x_1,y_1), \ldots, (x_n,y_n)\), we can estimate \(\beta_0\) and \(\beta_1\) to get our Least-squares regression line as
\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i, \] where \(\hat{y}_i\) is the predicted response corresponding to \(x_i\), and \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the estimated intercept and slope, respectively.
Least-squares Estimation: minimize the squared errors with respect to
the unknown parameters.
- This is accomplished by minimizing the following with respect to the
unknown parameters:
\[
\sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (\hat{\beta}_0 +
\hat{\beta}_1 x_i))^2
\] - This results in the estimators noted above for \(\beta_0\) and \(\beta_1\).
The straightforward way to make a prediction is just to plug into the regression equation with the estimated coefficients.
Estimate the boy’s height at age 50 months.
## one way
est = b0 + 50*b1
est
## [1] 42.77525
## using coef()
sum(cf * c(1,50))
## [1] 42.77525
## Visualization of this prediction
ggplot(df, aes(x = age, y = height)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
geom_point(aes(x=50, y=b0 + 50*b1), color="red", size=4) +
geom_vline(xintercept = 50, color="red", linetype="dashed") +
ylab("Height (inches)") +
xlab("Age (months)")
We can also understand this with standard units.
The value \(x=50\) is \(z = (50 - \bar{x})/s_x)\) standard deviations from the mean.
x0 = 50
z = (x0 - mean(x))/sd(x)
z
## [1] -0.528785
yhat = mean(y) + r*z*sd(y)
yhat
## [1] 42.77525
lions = read_csv("../../data/lions.csv") %>%
rename(black = proportion.black)
ggplot(lions, aes(x = age, y = black)) +
geom_point() +
geom_smooth(se = FALSE, method = "lm") +
xlab("Age (years)") +
ylab("Proportion of nose that is black")
lions_lm = lm(black ~ age, data = lions)
lions_cf = coef(lions_lm)
lions_cf
## (Intercept) age
## 0.06969626 0.05859115
summary(lions_lm)
##
## Call:
## lm(formula = black ~ age, data = lions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.20406 -0.07758 -0.01750 0.07913 0.29876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.069696 0.041956 1.661 0.107
## age 0.058591 0.008307 7.053 7.68e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1238 on 30 degrees of freedom
## Multiple R-squared: 0.6238, Adjusted R-squared: 0.6113
## F-statistic: 49.75 on 1 and 30 DF, p-value: 7.677e-08
lions_sum = lions %>%
summarize(across(everything(), list(mean = mean, sd = sd)),
n = n(),
r = cor(age, black)) %>%
relocate(n)
lions_sum = lions_sum %>%
mutate(b1 = r*black_sd/age_sd,
b0 = black_mean - b1*age_mean)
lions_sum %>%
print(width = Inf)
## # A tibble: 1 × 8
## n age_mean age_sd black_mean black_sd r b1 b0
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 32 4.31 2.68 0.322 0.199 0.790 0.0586 0.0697
yhat_1 = sum( c(1,10) * lions_cf )
yhat_1
## [1] 0.6556078
x0 = 10
z = (x0 - lions_sum$age_mean)/lions_sum$age_sd
z
## [1] 2.126077
lions_sum$r*z
## [1] 1.679234
yhat_2 = lions_sum$black_mean + lions_sum$r*z*lions_sum$black_sd
yhat_2
## [1] 0.6556078
We are going to generate a fake data set to use for our example. By using fake data, we can also check how close our regression estimates match the true values we used to simulate our data.
## Generate our fake data set
set.seed(246810)
n = 100 ## sample size
b0 = 1 ## intercept
b1 = 2.5 ## slope
x = runif(n, -3, 10) ## explanatory variable
y = rnorm(n,b0+b1*x,3) ## response variable
tibble(x,y) %>%
ggplot(aes(x = x, y = y)) +
geom_point()
## Estimate our slope and intercept
mx = mean(x)
my = mean(y)
sx = sd(x)
sy = sd(y)
r = cor(x,y) # correlation coef
r
## [1] 0.9567482
cor(x,y)
## [1] 0.9567482
cor(y,x)
## [1] 0.9567482
b1_hat = r *sy/sx
b1_hat
## [1] 2.462811
b0_hat = my - b1_hat*mx
b0_hat
## [1] 0.8643708
We also can use lm() to estimate the slope and
intercept:
df0 = tibble(x=x, y=y)
lm0 = lm(y~x, df0)
summary(lm0)
##
## Call:
## lm(formula = y ~ x, data = df0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5064 -2.0083 0.2062 1.9766 7.2757
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.86437 0.39085 2.211 0.0293 *
## x 2.46281 0.07565 32.557 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.951 on 98 degrees of freedom
## Multiple R-squared: 0.9154, Adjusted R-squared: 0.9145
## F-statistic: 1060 on 1 and 98 DF, p-value: < 2.2e-16
cf = coef(lm0)
cf
## (Intercept) x
## 0.8643708 2.4628111
c(b0, b1)
## [1] 1.0 2.5
One approach for checking if our linear assumption holds is to create a residual plot. Recall that residual corresponding to the \(i\)th observation is \(r_i = y_i - \hat{y}_i\).
library(modelr)
df0 = df0 %>%
add_residuals(lm0) %>%
add_predictions(lm0)
ggplot(df0, aes(x=x, y =resid)) +
geom_point() +
xlab("x") +
ylab("Residuals") +
geom_hline(aes(yintercept=0), color="red", linetype = "dashed")
As a comparison, below is a residual plot for a different simulated data set.
Do you see any pattern in this residual plot?
There is a clear pattern. The residuals near the boundaries are positive and the residuals near the middle of the x range are negative.
This suggests that the relationship between y and x is not linear and appears to be quadratic.
In previous lectures, we explored data from the NASA Exoplanet Archive. Two variables that are useful for characterizing an exoplanet are mass and radius.
The two most prolific methods for detecting exoplanets are the Transit Method and the Radial Velocity Method.
## Read in the csv file
## Select confirmed planets, rename some variables
planets = read_csv("../../data/exoplanets-clean-through-2022.csv")
The mass-radius relationship of exoplanets is the relationship between exoplanets’ radius, R, their mass, M.
Modeling the relationship between mass and radius is important for the following reasons:
## Number of mass and radius estimates
planets %>%
select(mass, radius) %>%
summarize_all(function(x) sum(!is.na(x)))
## # A tibble: 1 × 2
## mass radius
## <int> <int>
## 1 1397 3973
Let’s take a quick look at what the Mass-Radius relationship looks like for our exoplanet data. We’ll talk later about a way to model these data.
## How many observations do we have with both mass and radius estimates?
planets %>%
select(mass, radius) %>%
drop_na() %>%
nrow()
## [1] 1022
ggplot(planets, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)")
It’s hard to see any clear pattern in this plot. Let’s adjust the axis scales to see if that helps.
ggplot(planets, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_smooth(se=FALSE) +
geom_smooth(method="lm", se=FALSE, color="magenta")
Now we can see a bit more of a relationship between mass and radius on this log10 scale.
The general pattern is that there is a positive association between log10(radius) and log10(mass).
One of the popular models used for the Mass-Radius relationship is a power-law relationship:
\[ y = C \times x^\theta \] where \(y\) is the response variable, \(x\) is the explanatory variable, \(C\) is a scaling factor, and \(\theta\) is the power law coefficient.
power_law = function(theta){
df = tibble(x = seq(0, 10, by = .1),
y =x^theta)
gg = ggplot(df, aes(x,y)) +
geom_line() +
ggtitle(paste0("Power law exponent: ", theta)) +
theme_bw() +
theme(text=element_text(size=20))
return(gg)
}
power_law(1)
power_law(.5)
power_law(2)
While we will consider mass on the y-axis and radius on the x-axis, astronomers will often plot and model these the other way (with radius on the y-axis).
Astronomers have found that the power law relationship between mass and radius is not constant across the range of values, but instead there seems to be a different power law exponent for different mass ranges.
Below we define the data subset we will use for our model and plot mass vs. radius.
mr = planets %>%
filter(between(mass, 2, 127)) %>%
drop_na()
ggplot(mr, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_smooth(se=FALSE) +
geom_smooth(method="lm", se=FALSE, color="magenta")
We already saw the form of the power law relationship is \(y = C \times x^\theta\). Now we’d like to turn this into a statistical model that we can use for the exoplanet data.
The statistical model we will actually fit is going to use a log10 transformation of the power law relationship in order to make the form linear: \[ \begin{align*} y_i = \log10(m_i) & = \log10(C) + \theta\log10(r_i) + \varepsilon_i \\ & = \log10(C) + \theta x_i + \varepsilon_i, i = 1, \ldots, n. \end{align*} \] In this model, the response variable \(y_i = \log10(m_i)\) is the \(\log10\)(mass) for exoplanet \(i\), the explanatory variable \(x_i = \log10(r_i)\) is the \(\log10\)(radius) for exoplanet \(i\), \(\log10(C)\) is the (unknown) intercept, \(\theta\) is the (unknown) slope, and \(\varepsilon_i\) is the random error for exoplanet \(i\).
lm1 = lm(log10(mass) ~ log10(radius), data = mr)
summary(lm1)
##
## Call:
## lm(formula = log10(mass) ~ log10(radius), data = mr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21927 -0.19937 -0.02025 0.17505 1.71694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.60223 0.02860 21.06 <2e-16 ***
## log10(radius) 1.10338 0.04724 23.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3299 on 461 degrees of freedom
## Multiple R-squared: 0.542, Adjusted R-squared: 0.541
## F-statistic: 545.5 on 1 and 461 DF, p-value: < 2.2e-16
The estimated intercept is 0.602 and the estimated slope is 1.103.
Let’s check out our fit model on a plot:
ggplot(mr, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue")
Let’s also see what this looks like on the original scale.
mr %>%
mutate(mass_pred = 10^coef(lm1)[1]*radius^coef(lm1)[2]) %>%
ggplot(aes(radius, mass)) +
geom_point() +
geom_line(aes(y = mass_pred), color="red") +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)")
Recall that the estimated slope is the power law exponent. Since it is somewhat near to 1, this suggests that the data on the original scale are not too far from linear in radius.
mr %>%
mutate(mass_pred = 10^coef(lm1)[1]*radius^coef(lm1)[2]) %>%
ggplot(aes(radius, mass)) +
geom_point() +
geom_line(aes(y = mass_pred), color="red") +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
geom_smooth(method="lm", se=FALSE, color="blue")
summary(lm(mass ~ radius, data=mr))
##
## Call:
## lm(formula = mass ~ radius, data = mr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.461 -8.857 -3.793 3.528 114.188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.7461 1.4677 -1.871 0.062 .
## radius 6.9904 0.2579 27.100 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.4 on 461 degrees of freedom
## Multiple R-squared: 0.6144, Adjusted R-squared: 0.6135
## F-statistic: 734.4 on 1 and 461 DF, p-value: < 2.2e-16
#Back to slides
These models are quite different. We will do some model checking on the appropriateness of our linear model on the log10 scale next.
The linear model we fit uses least squares regression.
library(modelr)
mr = mr %>%
add_residuals(lm1) %>%
add_predictions(lm1)
ggplot(mr, aes(x=radius, y=mass)) +
geom_point() +
geom_segment(aes(xend = radius, yend = 10^pred), color="magenta") +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
theme_bw() +
theme(text=element_text(size=20))
The vertical magenta bars are the residuals for the model defined as \(r_i = y_i - \hat{y}_i\), where \(\hat{y}_i\) is the predicted \(\log10\)(mass) from our model for exoplanet \(i\).
lm1 fit.Next we are going to consider a residual plot. This is where we remove the fit model from the data, and only plot the errors (the residuals) against radius.
ggplot(mr, aes(x=radius, y=resid)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Residual (Earth Mass)") +
scale_x_log10() +
geom_hline(aes(yintercept = 0), color = "red", linetype = "dashed") +
theme_bw() +
theme(text=element_text(size=20)) +
geom_smooth(se=FALSE)
Patterns in a residual plot can suggest that our linear model model may not be appropriate for the data.
Let’s go back to our linear model, \[ y_i = \log10(C) + \theta x_i + \varepsilon_i, i = 1, \ldots, n. \]
We only mentioned that the \(\varepsilon_i\)’s are random errors, but we did not discuss other assumptions.
It is common to make the following assumptions (which should also be checked):
When desiring to do inference on the estimated parameters, another common assumption to make is that the errors are normally distributed, \(\varepsilon_i \sim N(0, \sigma)\).
This normality assumption on the errors has implications for the estimates of our parameters. In particular, it has the consequence that the estimators of our intercept and slope are normally distributed.
In a previous discussion assignment you were introduced to Z-scores and t-scores. The t-scores are going to show in our regression inference, so we review that information here.
Note that in this review, the estimator is the sample mean. When we get back to regression the estimators will be for the slope or intercept, which will result in some changes in the t-statistic and the degrees of freedom of the resulting t-distribution.
Assume a model where \(X_1,\ldots,X_n\) are drawn at random from a distribution with a mean \(\mu\) and a standard deviation \(\sigma\).
The sampling distribution of the sample mean, \(\bar{X} = n^{-1}\sum_{i=1}^n X_i\) has a mean \(\mu\) and standard deviation \(\sigma/\sqrt{n}\).
A mathematical derivation is required to show this formally, or simulation can be used to check if it the expressions are plausible.
We have seen in many settings that the z-statistic (substract the mean, divide by the standard deviation) often has an approximate standard normal distribution. \[ Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} \] If the distribution of each \(X_i\) is normal, then \(Z\) will be normal as well.
Even if \(X_i\) does not have a normal distribution, the distribution of \(\bar{X}\) will be approximately normal if \(n\) is large enough to overcome the non-normality, a result known as the central limit theorem.
However, \(\sigma\) is typically unknown and things are a bit different when the sample standard deviation \(s\) is substituted for \(\sigma\). \[ T = \frac{\bar{X} - \mu}{S/\sqrt{n}} \] where \[ S = \sqrt{ \frac{\sum_{i=1}^n (X_i - \bar{X})^2}{n-1} } \]
The added random variability in the denominator means that even when the distribution of a single random variable is exactly normal, the distribution of \(T\) is not.
Instead, it has a \(t\) distribution with \(n-1\) degrees of freedom, which is a bell-shaped density centered at zero, but more spread out than a standard normal density.
When the degrees of freedom becomes large, the \(t\) distribution is quite close to standard normal.
It is identical to standard normal when the degrees of freedom is infinite.
These R functions are similar to their normal counterparts.
rt(): generate random variables from a t
distributionpt(): find an area under a t densityqt(): find a quantile from a t densitydt(): return the height of the t densityIn addition, the following functions are available in the script ggprob.R to add t densities to plots.
geom_t_density(): add a t density to a plotgeom_t_fill(): add a filled t densitygt(): graph a t densityThe following graph shows a standard normal distribution in black and t distributions with degrees of freedom equal to 1, 2, 4, 8, , 1028 ranging from yellow to violet.
A confidence interval for \(\mu\) has the form \[ \bar{x} \pm t^* \frac{s}{\sqrt{n}} \] where \(t^*\) is selected so that the area between \(-t^*\) and \(t^*\) under a t density with \(n-1\) degrees of freedom is the desired confidence level.
A confidence interval for a difference between means, \(\mu_1 - \mu_2\), has the form \[
\bar{x}_1 - \bar{x}_2 \pm t^*
\sqrt{ \frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} }
\] where \(t^*\) is from a \(t\) distribution where the degrees of
freedom is estimated as a function of the sample sizes and standard
deviations. Use the function t.test(). This approach
assumes that the standard deviations of the two populations need not be
the same.
When using t distribution methods, p-values are found by calculating the t statistic and finding areas under t distributions.
These concepts will be used to carry out inference on the estimated parameters of our simple linear model, which we will do next.
If we make the assumption that the errors on linear model are normally distributed, we can carryout hypothesis test on our estimated parameters.
We will focus on inference for the slope parameter since that tends to be the more scientifically interesting parameter.
The hypothesis test we will carry out is \[ H_0: \theta = 1 \\ H_a: \theta \neq 1 \]
We test the null that our slope parameter \(\theta\) (which is the power law exponent) is 1, suggesting a linear relationship between \(\log10\)(mass) and \(\log10\)(radius), against the alternative that there is a power law relationship.
This leads to the test statistic \[ T = \frac{\hat{\theta} - 1}{s_{\hat{\theta}}} \] where \(s_{\hat{\theta}}\) is the standard error of \(\hat{\theta}\). Note that the 1 is from the null hypothesis assumption that \(\theta = 1\).
In case you were curious, the formula for this standard error is \[ s_{\hat{\theta}} = \sqrt{\frac{\sum_{i = 1}^n (y_i - \hat{y}_i)^2/(n-2)}{\sum_{i = 1}^n (x_i - \overline{x})^2}} \] where \(\overline{x}\) is the sample mean of the \(x_i\), and \(\hat{y}_i\) is the predicted \(\log10\)(mass) values for observation \(i\).
We can calculate this as follows:
n = nrow(mr)
syy = sum(mr$resid^2)
sxx = sum((log10(mr$radius) - mean(log10(mr$radius)))^2)
sqrt(syy/(n-2)/sxx) ## standard error using the formula above
## [1] 0.04724113
coef(summary(lm1))[2, "Std. Error"] ## standard error from our lm1 model
## [1] 0.04724113
This is also the standard error that would be used in a confidence interval for \(\theta\): \(\hat{\theta} \pm t_{n-2} s_{\hat{\theta}}\), where \(t_{n-2}\) is selected based on the desired confidence level.
The output of our estimated model gives us the result of (a different) hypothesis test: \[ H_0: \theta = 0 \\ H_a: \theta \neq 0 \]
summary(lm1)
##
## Call:
## lm(formula = log10(mass) ~ log10(radius), data = mr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21927 -0.19937 -0.02025 0.17505 1.71694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.60223 0.02860 21.06 <2e-16 ***
## log10(radius) 1.10338 0.04724 23.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3299 on 461 degrees of freedom
## Multiple R-squared: 0.542, Adjusted R-squared: 0.541
## F-statistic: 545.5 on 1 and 461 DF, p-value: < 2.2e-16
The estimated slope is \(\hat{\theta} = 1.12339\) with an estimated standard error of \(SE(\hat{\theta})=0.05088\).
This leads to a t-statistic of \[ T = \frac{1.12339 - 0}{0.05088} = 22.08 \]
We see that there are 368 observations in our data set. Since there are two estimated parameters in our linear model (slope and intercept), we use 368-2 = 366 degrees of freedom for the t-distribution for our slope.
## Number of observations
mr %>%
nrow()
## [1] 463
## Compute our p-value
pt(22.08, df=366, lower.tail=FALSE)*2 ## P(T >= 22.08) x 2
## [1] 2.785903e-69
Notice that this p-value is very small leading us to reject the null hypothesis that the slope is zero.
To carry out our hypothesis test with \[ H_0: \theta = 1 \\ H_a: \theta \neq 1 \] we use the following t-statistic: \[ T = \frac{1.12339 - 1}{0.05088} = 2.425118 \]
## t-statistic
tstat = (coef(summary(lm1))[2]-1)/coef(summary(lm1))[2, "Std. Error"]
tstat
## [1] 2.188302
## Compute our p-value
pt(tstat, df=nrow(mr)-2, lower.tail=FALSE)*2 ## P(T >= 2.425118) x 2
## [1] 0.02914877
Since the p-value is less than 0.05, there is evidence suggesting that we can reject the null hypothesis. In other words, the results suggest that a linear model is not appropriate, in favor of a power law exponent greater than 1 (p-value=0.016, two-sided t-test).
We can create a plot to visualize the p-value:
gt(nrow(mr)-2) +
geom_vline(aes(xintercept = c(-abs(tstat), tstat)), color="red", linetype="dashed") +
geom_t_fill(nrow(mr)-2, a = abs(tstat), b=6) +
geom_t_fill(nrow(mr)-2, a=-6, b = -abs(tstat))
Instead of relying on the theoretical values for the standard deviation of the parameters, we could instead run a simulation.
We can use the parametric bootstrap, which involves generating many realizations of the data using the initial estimate for \(\theta\) and then fitting the regression model on the simulated data set to obtain many estimates of \(\theta\). The mean and standard deviation of these values can be used for inference.
Here are the steps for the parametric bootstrap:
n = mr %>%
select(mass, radius) %>%
drop_na() %>%
nrow()
sigma_resid = sd(mr$resid)
N = 100 # 10000
b_0 = numeric(N)
b_1 = numeric(N)
for ( i in 1:N )
{
mr_new = mr %>%
drop_na() %>%
mutate(mass = 10^(pred + rnorm(n,0,sigma_resid))) # these are the "new" y's
lm2 = lm(log10(mass) ~ log10(radius), data=mr_new)
b_0[i] = coef(lm2)[1]
b_1[i] = coef(lm2)[2]
}
df_coef = tibble(b_0,b_1)
Let’s look at the slope parameter. We can estimate the mean and standard deviation from the parametric bootstrap simulation, and then compare it to a normal distribution with the same mean and standard deviation.
mean_slope = mean(df_coef$b_1)
mean_slope
## [1] 1.110855
sd_slope = sd(df_coef$b_1)
sd_slope
## [1] 0.04822574
ggplot(df_coef, aes(x=b_1)) +
geom_density() +
xlab("theta") +
ylab("Density") +
geom_norm_density(mu = mean_slope, sigma = sd_slope, color="blue") +
ggtitle("Exoplanet Mass-Radius Relationship",
subtitle = "Parametric bootstrap distribution of the slope (black), and a normal density (blue)")
The simulated distribution (black curve) follows the normal distribution (blue curve) quite well.
So how do these values compare to what we calculated with the theoretical equations?
paste("Bootstrap: ", round(mean_slope,3), round(sd_slope,3))
## [1] "Bootstrap: 1.111 0.048"
paste("Theoretical: ", round(coef(lm1)[2],3), round(coef(summary(lm1))[2, "Std. Error"],3))
## [1] "Theoretical: 1.103 0.047"
Very close as well!
The mass-radius relationship of exoplanets is the relationship between exoplanets’ radius, R, their mass, M.
Modeling the relationship between mass and radius is important for the following reasons:
## Number of mass and radius estimates
planets %>%
select(mass, radius) %>%
summarize_all(function(x) sum(!is.na(x)))
## # A tibble: 1 × 2
## mass radius
## <int> <int>
## 1 1397 3973
Let’s take a quick look at what the Mass-Radius relationship looks like for our exoplanet data. We’ll talk later about a way to model these data.
ggplot(planets, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_smooth(se=FALSE) +
geom_smooth(method="lm", se=FALSE, color="magenta")
The general pattern is that there is a positive association between log10(radius) and log10(mass).
While we will consider mass on the y-axis and radius on the x-axis, astronomers will often plot and model these the other way (with radius on the y-axis).
Astronomers have found that the power law relationship between mass and radius is not constant across the range of values, but instead there seems to be a different power law exponent for different mass ranges.
Below we define the data subset we will use for our model and plot mass vs. radius.
mr = planets %>%
filter(between(mass, 2, 127)) %>%
drop_na()
ggplot(mr, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_smooth(se=FALSE) +
geom_smooth(method="lm", se=FALSE, color="magenta")
The form of the power law relationship is \(y = C \times x^\theta\).
The statistical model we will actually fit is going to use a log10 transformation of the power law relationship in order to make the form linear: \[ \begin{align*} y_i = \log10(m_i) & = \log10(C) + \theta\log10(r_i) + \varepsilon_i \\ & = \log10(C) + \theta x_i + \varepsilon_i, i = 1, \ldots, n. \end{align*} \] In this model, the response variable \(y_i = \log10(m_i)\) is the \(\log10\)(mass) for exoplanet \(i\), the explanatory variable \(x_i = \log10(r_i)\) is the \(\log10\)(radius) for exoplanet \(i\), \(\log10(C)\) is the (unknown) intercept, \(\theta\) is the (unknown) slope, and \(\varepsilon_i\) is the random error for exoplanet \(i\).
lm1 = lm(log10(mass) ~ log10(radius), data = mr)
summary(lm1)
##
## Call:
## lm(formula = log10(mass) ~ log10(radius), data = mr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21927 -0.19937 -0.02025 0.17505 1.71694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.60223 0.02860 21.06 <2e-16 ***
## log10(radius) 1.10338 0.04724 23.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3299 on 461 degrees of freedom
## Multiple R-squared: 0.542, Adjusted R-squared: 0.541
## F-statistic: 545.5 on 1 and 461 DF, p-value: < 2.2e-16
## Add residuals and predicted values to our data frame
mr = mr %>%
add_residuals(lm1) %>%
add_predictions(lm1)
The estimated intercept is 0.602 and the estimated slope is 1.103.
Let’s take a look at our estimated model on a plot again:
ggplot(mr, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
theme(text = element_text(size=20))
The estimated regression model is \[ \hat{y}_i = 0.602 + 1.103 x_i. \]
We can use the estimated linear model to predict a mass for a given radius.
Note that the function below assumes the input x is
radius on the original scale and returns an estimated mass on the
original scale.
predict_y = function(x){
## x = radius (on original scale)
slope = coef(lm1)[2]
intercept = coef(lm1)[1]
logy = intercept + slope*log10(x)
y = 10^logy
names(y) = "predicted mass"
return(y)
}
radius_input = 3
mass_predicted = predict_y(3)
mass_predicted
## predicted mass
## 13.44852
We can plot this point as well:
ggplot(mr, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
geom_point(aes(x = radius_input, y=mass_predicted), color = "red", size = 2) +
theme(text = element_text(size=20))
An interesting and useful feature of the least squares regression line is that it goes through the point \((\bar{x}, \bar{y})\). We can check this with our function as well.
radius = 10^mean(log10(mr$radius)) ## mean of log10(radius), transformed back to original scale for function
radius
## [1] 3.244174
log10(predict_y(radius)) ## predicted value at xbar
## predicted mass
## 1.166171
print(paste0("(xbar,ybar) = (",round(mean(log10(mr$radius)),3), ", ", round(mean(log10(mr$mass)),3),")"))
## [1] "(xbar,ybar) = (0.511, 1.166)"
We can get our predicted values of mass, but there is uncertainty in this estimate, and it can be desirable to define an interval around the estimate to capture this uncertainty.
There are two common ways to look at this problem of predicting \(\hat{y}\):
We will focus on the confidence interval version next.
The goal is an uncertainty interval around the parameter \(E(y \mid x^*)\), the expected value of the response given the explanatory variable \(x^*\).
The uncertainty in the estimated \(\hat{y}\) needs to account for the uncertainty in the estimated intercept and estimated slope. The formula for this uncertainty is \[ s_{\hat{y}} = \sqrt{\left((n-2)^{-1} \sum_{i=1}^n(y_i - \hat{y}_i)^2\right)\left(n^{-1} + \frac{(x^* - \overline{x})^2}{\sum_{i = 1}^n (x_i - \overline{x})^2} \right)} \]
where \(x^*\) is the radius (on the original scale) for which the confidence interval of \(E(y \mid x^*)\) is desired.
We can write a function to compute \(s_{\hat{y}}\):
s_yhat = function(x){
## x = radius on original scale
n = nrow(mr)
syy = sum(mr$resid^2)/(n-2)
mean_logx <-mean(log10(mr$radius))
sxx = sum((log10(mr$radius) - mean_logx)^2)
out = sqrt(syy*(1/n + (log10(x)-mean_logx)^2/sxx))
return(out)
}
s_yhat(3)
## [1] 0.01541483
Next we add the lower and upper bounds for 95% confidence interval to
mr, then add them to our plot.
n = nrow(mr)
mr = mr %>%
mutate(y_plus_se = pred + qt(.975, n-2)*s_yhat(radius),
y_minus_se = pred - qt(.975, n-2)*s_yhat(radius))
ggplot(mr, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_smooth(method="lm", se=TRUE, color="red")+
geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
geom_line(aes(x = radius, y= 10^y_plus_se), color = "red", linetype="dashed") +
geom_line(aes(x = radius, y= 10^y_minus_se), color = "red", linetype="dashed") +
geom_vline(aes(xintercept = 10^(mean(log10(radius)))), color="blue", linetype="dotted") +
theme(text = element_text(size=20))
Notice that our 95% confidence interval outlines the shaded region
when we use geom_smooth(method="lm", se=TRUE)!
It is subtle in the plot, but confidence interval narrows as \(x^*\) gets closer to the mean of the \(\log10\)(radius).
The uncertainty in some random outcome \(y^*\) needs to account for the uncertainty in the estimated intercept and estimated slope, but also the uncertainty related to the randomness of the outcome (from the \(\varepsilon^*\) associated with \(y^*\)).
The formula for this uncertainty is \[ s_{\hat{y}^*} = \sqrt{\left((n-2)^{-1} \sum_{i=1}^n(y_i - \hat{y}_i)^2\right)\left(1 + n^{-1} + \frac{(x^* - \overline{x})^2}{\sum_{i = 1}^n (x_i - \overline{x})^2} \right)} \]
where \(x^*\) is the radius (on the original scale) that is associated with \(\hat{y}^*\).
We can write a function to compute \(s_{\hat{y}}\):
s_yhatstar = function(x){
## x = radius on original scale
n = nrow(mr)
syy = sum(mr$resid^2)/(n-2)
mean_logx <-mean(log10(mr$radius))
sxx = sum((log10(mr$radius) - mean_logx)^2)
out = sqrt(syy*(1 + 1/n + (log10(x)-mean_logx)^2/sxx))
return(out)
}
s_yhatstar(3)
## [1] 0.3302439
Next we add the lower and upper bounds for 95% prediction interval to
mr, then add them to our plot
n = nrow(mr)
mr = mr %>%
mutate(ystar_plus_se = pred + qt(.975, n-2)*s_yhatstar(radius),
ystar_minus_se = pred - qt(.975, n-2)*s_yhatstar(radius))
ggplot(mr, aes(radius, mass)) +
geom_point() +
xlab("Radius (Earth Radius)") +
ylab("Mass (Earth Mass)") +
scale_x_log10() +
scale_y_log10() +
geom_line(aes(x = radius, y= 10^ystar_plus_se), color = "green", linetype="dashed", size= 1.5) +
geom_line(aes(x = radius, y= 10^ystar_minus_se), color = "green", linetype="dashed", size= 1.5) +
geom_smooth(method="lm", se=TRUE, color="red")+
geom_abline(aes(slope = coef(lm1)[2] , intercept = coef(lm1)[1]), color="blue") +
geom_line(aes(x = radius, y= 10^y_plus_se), color = "red", linetype="dashed") +
geom_line(aes(x = radius, y= 10^y_minus_se), color = "red", linetype="dashed") +
geom_vline(aes(xintercept = 10^(mean(log10(radius)))), color="blue", linetype="dotted") +
theme(text = element_text(size=20))
Notice the 95% prediction interval (green dashed lines) is wider than the 95% confidence interval.